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Abstract — Support vector machines (SVMs) are invaluable 
tools for many practical applications in artificial intelligence, 
e.g., classification and event recognition. However, popular 
SVM solvers are not sufficiently efficient for applications with a 
great deal of samples as well as a large number of features. In 
this paper, thus, we present NESVM, a fast gradient SVM 
solver that can optimize various SVM models, e.g., classi- 
cal SVM, linear programming SVM and least square SVM. 
Compared against SVM-Perf (THU (its convergence rate in 
solving the dual SVM is upper bounded by 0{1/Vk), wherein 
k is the number of iterations.) and Pegasos |3| (online SVM 
that converges at rate 0{l/k) for the primal SVM), NESVM 
achieves the optimal convergence rate at 0{l/k'^) and a linear 
time complexity. In particular, NESVM smoothes the non- 
differentiable hinge loss and ^i-norm in the primal SVM. Then 
the optimal gradient method without any line search is adopted 
to solve the optimization. In each iteration round, the current 
gradient and historical gradients are combined to determine the 
descent direction, while the Lipschitz constant determines the 
step size. Only two matrix-vector multiplications are required 
in each iteration round. Therefore, NESVM is more efficient 
than existing SVM solvers. In addition, NESVM is available for 
both linear and nonlinear kernels. We also propose "homotopy 
NESVM" to accelerate NESVM by dynamically decreasing the 
smooth parameter and using the continuation method. Our 
experiments on census income categorization, indoor/outdoor 
scene classification event recognition and scene recognition 
suggest the efficiency and the effectiveness of NESVM. The 
MATLAB code of NESVM will be available on our website for 
further assessment. 

Keywords -Support vector machines; smooth; hinge loss; ii 
norm; Nesterov's method; continuation method; 

I. Introduction 

Support Vector Machines (SVMs) are prominent machine 
learning tools for practical artificial intelligence applications 
|4||5|. However, existing SVM solvers are not sufficiently 
efficient for practical problems, e.g., scene classification 
and event recognition, with a large number of training 
samples as well as a great deal of features. This is because 
the time cost of working set selection or Hessian matrix 
computation in conventional SVM solvers rapidly increases 
with the slightly augmenting of the data size and the feature 
dimension. In addition, they cannot converge quickly to the 
global optimum. Recently, efficient SVM solvers have been 
intensively studied on both dual and primal SVMs. 

Decomposition methods, e.g., sequential minimal opti- 
mization (SMO) 161, LIBSVM |7| and SVM-Light |8|, were 
developed to reduce the space cost for optimizing the dual 
SVM. In each iteration round, they consider a subset of 
constraints that are relevant to the current support vectors 



and optimize the corresponding dual problem on the selected 
working set by casting it into a quadratic programming (QP) 
problem. However, they are impractical to handle large scale 
problems, because their time complexities are super linear in 
n and the maximization of the dual objective function leads 
to a slow convergence rate to the optimum of the primal 
objective function. 

Structural SVM, e.g., SVM-Perf |ni|[2l|[9l, is recently pro- 
posed to improve the efficiency of optimization on the dual 
SVM. It reformulates the classical SVM into a structural 
form. In each iteration round, it firstly computes the most 
violated constraint from the training set by using a cutting- 
plane algorithm and adds this constraint to the current 
working set, then a QP solver is applied to optimize the 
corresponding dual problem. The Wolfe Dual of structural 
SVM is sparse and thus the size of each QP problem is 
small. It has been proved that the convergence rate of SVM- 
Perf is upper bounded by 0{1/Vk) and a lot of successful 
applications show the efficiency of SVM-Perf. However, it 
cannot work well when classes are difficult to be separated, 
e.g., the overlap between classes is serious or distributions 
of classes are seriously imbalanced. In this scenario, a large 
C is required to increase the support vectors and thus it is 
inefficient to find the most violated constraint in SVM-Perf. 

Many recent research results |10| show advantages to 
solve the primal SVM on large scale datasets. However, 
it is inefficient to directly solve the corresponding QP of 
the primal SVM if the number of constraints is around 
the number of samples, e.g., the interior point method for 
solving the primal SVM. One available solution is to write 
each of the constraints to the objective function as a hinge 
loss h and reformulate the problem as an unconstrained one. 

Let X G M^><^ and ^ G be the training dataset 
and the corresponding label vector, respectively, where the 
vector Xi G W is the i^^ sample in X and yi G {1,-1} 
is the corresponding label. Let the weight vector w be the 
classification hyper-plane. The reformulated primal problem 
of classical SVM is given by 



min F(w) -WwW^ ■ 
h{yiXi,w 



C^h{yiXi,w) , 



i=l 



max {0, 1 — yiXiw} . 



(1) 

(2) 



Since the hinge loss is non-differentiable, first order 
methods, e.g., subgradient method and stochastic gradient 
method, can achieve the solution with the convergence 
rate 0{1/Vk), which is not sufficiently fast for large-scale 



problems. Second order methods, e.g., Newton method and 
Quasi-Newton method, can obtain the solution as well by 
replacing the hinge loss with differentiable approximations, 
e.g., max {0,1 — yiXiwY used in |10| or the integral of 
sigmoid function used in [llj. Although, the second order 
methods achieve the optimal convergence rate at (9(1 //c^), it 
is expensive to calculate the Hessian matrix in each iteration 
round. Therefore, it is impractical to optimize the primal 
SVM by using the second order methods. 

Recently, Pegasos fT], a first order online method, was 
proposed by introducing a projection step after each stochas- 
tic gradient update. It converges at rate 0{l/k). In addi- 
tion, its computational cost can be rapidly reduced if the 
feature is sparse, because the computation of the primal 
objective gradient can be significantly simplified. Therefore, 
it has been successfully applied to document classification. 
However, it hardly outperforms SVM-Perf when the feature 
is dense, which is a frequently encountered situation in 
artificial intelligence, e.g., computer vision tasks. 

In this paper, we present and analyze a fast gradient 
SVM framework, i.e., NESVM, which can solve the primal 
problems of typical SVM models, i.e., classical SVM (C- 
SVM) Ida, linear programming SVM (LP-SVM) d and 
least square SVM (LS-SVM) [14J, with the proved optimal 
convergence rate 0(l//c^) and a linear time complexity. 
The "NES" in NESVM refers to Nesterov's method to 
acknowledge the fact that NESVM is based on the method. 
Recently, Nesterov's method has been successfully applied 
to various optimization problems |[T5lll[T6ll . e.g., compressive 
sensing, sparse covariance selection, sparse PCA and ma- 
trix completion. The proposed NESVM smoothes the non- 
differentiable parts, i.e., hinge loss and ^i-norm in the primal 
objective functions of SVMs, and then uses a gradient- 
based method with the proved optimal convergence rate to 
solve the smoothed optimizations. In each iteration round, 
two auxiliary optimizations are constructed, a weighted 
combination of their solutions is assigned as the current 
SVM solution, which is determined by the current gradient 
and historical gradients. In each iteration round, only two 
matrix- vector multiplications are required. Both linear and 
nonlinear kernels can be easily applied to NESVM. The 
speed of NESVM remains fast when dealing with dense 
features. 

We apply NESVM to census income categorization fTTl, 
indoor scene classification 1 18], outdoor scene classification 
|[T9l and event recognition |20| on publicly available 
datasets. In these applications, we compare NESVM against 
four popular SVM solvers, i.e., SVM-Perf, Pegasos, SVM- 
Light and LIB SVM. Sufficient experimental results indicate 
that NESVM achieves the shortest CPU time and a compa- 
rable performance among all the SVM solvers. 



II. NESVM 

We write typical primal SVMs in the following unified 
form: 

min F(w) = R(w) + C • LiyiXi, w), (3) 

weRp 

where R{w) is a regularizer inducing the margin maximiza- 
tion in SVMs, L{yiXi^ w) is a loss function for minimizing 
the classification error, and C is the SVM parameter. For 
example, R{w) is the £2 -norm of w in C-SVM, R{w) is 
the £i-norm of w in LP-SVM, L{yiXi, w) is the hinge loss 
of the classification error in C-SVM, and L{yiXi^w) is the 
least square loss of the classification error in LS-SVM. It is 
worth emphasizing that nonlinear SVMs can be unified as 
Eq|3]as well. Details are given at the end of Section 2.3. 

In this section, we introduce and analyze the proposed fast 
gradient SVM framework, i.e., NESVM, based on EqlU We 
first show that the non-differentiable parts in SVMs, i.e., 
the hinge loss and the ^i-norm, can be written as saddle 
point functions and smoothed by subtracting respective prox- 
f unctions. We then introduce Nesterov's method [21 J to 
optimize the smoothed SVM objective function F^{w). 
In each iteration round of NESVM, two simple auxil- 
iary optimizations are constructed and the optimal linear 
combination of their solutions is adopted as the solution 
of EqIS at the current iteration round. NESVM is a first 
order method and achieves the optimal convergence rate 
of 0{l/k'^). In each iteration round, it requires only two 
matrix- vector multiplications. We analyze the convergence 
rate and time complexity of NESVM theoretically. An accel- 
erated NESVM using continuation method, i.e., "homotopy 
NESVM" is introduced at the end of this section. Homotopy 
NESVM solves a sequence of NESVM with decreasing 
smooth parameter to obtain an accurate approximation of 
the hinge loss. The solution of each NESVM is used as 
the "warm start" of the next NESVM in homotopy NESVM 
and thus the computational time for each NESVM can be 
significantly saved. 

A. Smooth the hinge loss 

In SVM and LP-SVM, the loss function is given by 
the sum of all the hinge losses, i.e., L{yiXi^w) = 
'^^=ih{yiXi^w), which can be equivalently replaced by 
the following saddle point function, 

n 

min h iyiXi^ w) = min max(e — YXw^ u)^ (4) 

i=l 

Q = {u:0<Ui<l,ue R""}, 

where e is a vector full of 1 and Y = Diag(?/). According 
to [21 J, the above saddle point function can be smoothed by 
subtracting a prox-function di{u). The di{u) is a strongly 
convex function of u with a convex parameter ai > and 
the corresponding prox-center uq = argmin^^Q di {u). Let 
Ai be the i^^ row of the matrix A. We adopt di {u) = 




— hinge loss 

—smoothed hinge loss (|i=0.5) 
—smoothed hinge loss (|i=1) 
—smoothed hinge loss (|i=2) 
smoothed hinge loss (|i=4) 



Figure 1. Hinge loss and smoothed hinge loss 



In NESVM, the gradient of L{yiXi^w) is used to deter- 
mine the descent direction. Thus, the gradient of the sum of 
the smoothed hinge losses is given by 

dw dw 

In NESVM, the Lipschitz constant of L{yiXi^ w) is used 
to determine the step size of each iteration. 

Definition 1. Given function f{x), for arbitrary and x^, 
Lipschitz constant L satisfies 

\\Vf{x')-Vf{x^)h<L\\x' -x^2- (10) 
Thus the Lipschitz constant of can be calculated from 



ELi ll^illoo^f in NESVM and thus the smoothed hinge According to EqlH we have 
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(11) 













loss hfj^ can be written as 



Ml 



max^^i (1 - ViXiw) - -|l-^illoo^i , 
ueQ z 



(5) 



where /i is the smooth parameter. Since di{u) is strongly 
convex, Ui can be obtained by setting the gradient of the 
objective function in EqUJas zero and then projecting Ui on 
S, i.e., 



Thus, 



0, ViXiW > 1 or < 1 — /i; 

Xj^Xi(w^-w^) 

^ , else. 



(12) 



max - 



w 



median 



[ 1 - ^i- 



,0,1 



< 



Lk' (13) 



(6) 



Therefore, the smoothed hinge loss is a piece-wise 
approximation of H according to different choices of Ui in 
Eql6l i.e.. 



Hence the Lipschitz constant of L{yiXi^ w) (denoted as L^) 
is calculated as 



La. (14) 




fll^.lloo, 



ViXiW > 1; 
ViXiW < 1 - /i; 

else. 



(7) 

Fig ID plots the hinge loss h and smoothed hinge loss 
with different /i. The figure indicates that a larger /i 
induces a more smooth with larger approximation error. 
The following theorem shows the theoretical bound of the 
approximation error. 

Theorem 1. The hinge loss h is bounded by its smooth ap- 
proximation h^, and the approximation error is completely 
controlled by the smooth parameter ja. For any w, we have 



(8) 



B. Smooth the £i-norm 

In LP-SVM, the regularizer is defined by the sum of 
all £i-norm i{wi) = \wi\, i.e., R{w) = The 
minimization of R{w) can be equivalently replaced by the 
following saddle point function, 

p 

min £(wi) = min max(K;,ii), (15) 

weRp weRp ueQ 

Q = {u : -1 <Ui <l,u eW} . 

The above saddle point function can be smoothed by sub- 
tracting a prox-function di{u). In this paper, we choose the 
prox-function di {u) = (l/2)||ii||2 and thus the smoothed 
£i-norm £^ can be written as. 



According to Eq|6l and Eq|7l the gradient of for the 



sample is calculated as: 

' 0, 

- iviXif , 

-iy^X,)'^■il-yiXiw) 

/ill^.lloo 

{ViXif Ui. 



ueQ z 



(16) 



dw 



Ui 0; 
Ui 1; 



Ui 



l — yiXiW 



Since di {u) is strongly convex, Ui can be achieved by setting 
the gradient of the objective function in Eq[T6l as zero and 
then projecting Ui on Q, i.e.. 



(9) 



median 



{7—} 



(17) 




Figure 2. ^i-norm and smoothed ^i-norm 



where u can also be explained as the result of a soft 
thresholding of w. Therefore, the smoothed ^i-norm is a 
piece- wise approximation of i, i.e., 



-Wi 
Wi - 

2fi ' 



Wi < -/i; 
Wi > ii\ 

else. 



Fig El plots the ^i-norm ^ and the smoothed ^i-norm with 
different /i. It shows that a larger ji induces a more smooth 
with larger approximation error. The following theorem 
shows the theoretical bound of the approximation error. 

Theorem 2. The £i-norm £ is bounded by its smooth 
approximation and the approximation error is completely 
controlled by the smooth parameter ji. For any w, we have 



< 



<£ +^ 



(18) 



In NESVM, the gradient of R{w) is used to determine 
the descent direction. Thus, the gradient of the sum of the 
smoothed ^i-norm £^ is 



dw 



u. 



(19) 



In NESVM, the Lipschitz constant of R{w) is used to 
determine the step size of each iteration. According to 
the definition of Lipschitz constant and the second order 
derivative of £^ is given by 



dw'^ 



1 



(20) 



the Lipschitz constant of the sum of smoothed ^i-norm is 
given by 



Ljj^ = max < 



C. Nesterov 's method for SVM 

We apply Nesterov 's method 1211 to minimize the 
smoothed primal SVM F^{w). It is a gradient method with 
the proved optimal convergence rate 0{l/k'^). In its k^^ 
iteration round, two auxiliary optimizations are constructed 
and their solutions are used to build the SVM solution at the 
same iteration round. We use w^, and to represent the 
solutions of SVM and its two auxiliary optimizations at the 
k^^ iteration round, respectively. The Lipschitz constant of 
Fj^{w) is Lj^ and the two auxiliary optimizations are, 

min (VF^ y ^" 



yeRP 



min ^d2{z) + E ^ [^^K) + WK)' ^ - ^')] ■ 

zeKP (72 — Z 

We choose the prox-function d2{z) = ||z — t(;^||2/2 whose 
strong convexity parameter is (72, where w* is the prox- 
center and (72 = 1. The w'^ is usually selected as a guess 
solution of w. 

By directly setting the gradients of the two objective 
functions in the auxiliary optimizations as zeros, we can 
obtain and respectively, 

7^VF^(^^), (22) 



y^ w^ 



1 



VF^(^0. 



(23) 



We have the following interpretation of the above results. 
The y^ is a solution of the standard gradient descent with 
step size at the k^^ iteration round. The z^ is a 

solution of a gradient descent step that starts from the 
guess solution w^ and proceeds along a direction deter- 
mined by the weighted sum of negative gradients in all 
previous iteration rounds. The weights of gradients at later 
iteration rounds are larger than those at earlier iteration 
rounds. Therefore, y^ and z^ encode the current gradient 
and historical gradients. In NESVM, their weighted sum 
determines the SVM solution after the k^^ iteration round, 

w^+' = -^z^^^y\ (24) 

Let V^/e be the optimal objective value of the second auxiliary 
optimization, according to |21j, we arrive at the following 
theorem. 

Theorem 3. For any k and the corresponding y^, z^ and 
w^^^ defined by Eqli22\ EqlQ3\ and Eq]2M respectively, we 
have 

(fc + l)(fc + 2)^ ... 



(25) 



Theorem [3] is a direct result of Lemma 2 in [21 J and it 
will be applied to analyze the convergence rate of NESVM. 



Algorithm 1 NESVM 



Input: YX, w^, , C, ja and e 
Output: weight vector w 
Initialize: k = 
repeat 

Step 1: Compute dual variable u 
Step 2: Compute gradient VF^{w^) 
Step 3: Compute and using Eq|22land Eql23l 
Step 4: Update SVM solution w^^^ using Eql2jl 
Step 5: A: = A: + 1 
until \F^{w^+^) -F^{w^)\<e 



return w 



A small smooth parameter ja can improve the accuracy of 
the smooth approximation. A better guess solution that 
is close to the real one can improve the convergence rate 
and reduce the training time. 

Algorithm 1 details the procedure of NESVM. In particu- 
lar, the input parameters are the matrix YX, the initial solu- 
tion w^, the guess solution w"^, the parameter C, the smooth 
parameter /i and the tolerance of termination criterion e. In 
each iteration round, the dual variable u in smooth parts 
is first computed, then the gradient \/F^{w) is calculated 
from u, and are calculated from the gradient, and 
finally w^~^^ is updated at the end of the iteration round. 
NESVM conducts the above procedure iteratively until the 
convergence of Fj^{w). 

NESVM contains no expensive computations, e.g., line 
search and Hessian matrix calculation. The most computa- 
tional costs are two matrix-vector multiplications in Steps 
1 and 2, i.e., (YX) w and (YX)^ u. Since most elements 
of u are or 1 and the proportion of these elements will 
rapidly increase with the decreasing of /i, the computation 
of {YX)^ u can be further simplified. In addition, this 
simplification indicates that the gradient of each iteration 
round is completely determined by support vectors in the 
current iteration round. These support vectors correspond to 
the nonzero elements in u. 

The above algorithm can be conveniently extended to 
nonlinear kernels by replacing the data matrix X with 
K{X,X)Y, where K{X,X) is the kernel matrix, and 
replacing the penalty \\w\\2 with w^K{X^X)w. 

If a bias h in an SVM classifier is required, let 



[w] h] and X := [X, e] . 



(26) 



Since the ^2 norm of h is not penalized in the original 
SVM problem, we calculate the gradient of h according to 
dF{w)/db = dL{yiXi^w)/db in Algorithm 1, and the last 
entry of the output solution w is the bias b. 

D. Convergence Analysis 

The following theorem shows the convergence rate, the 
iteration number and the time complexity of NESVM. 



Theorem 4. The convergence rate of NESVM is 0{1/P). 
It requires 0{l/^/e) iteration rounds to reach an e accurate 
solution. 

Proof: Let the optimal solution be w"^. Since F^{w) is 
a convex function, we have 

i^^K) > ^M) + (Vi^/iK). ^* - ^')- (27) 

Thus, 

k 



L 

cr2 



According to Theorem |3l we have 

(A; + l)(fc + 2) 



(28) 
(29) 

(30) 



L 

0-2 



F^{w*). (31) 



Hence the accuracy at the k iteration round is 



F,{y^)-F,{wn< 



4L^d2(^*) 



(32) 



(/c + l)(fc + 2)* 

Therefore, NESVM converges at rate and the 

minimum iteration number to reach an e accurate solution 
is 0{l/^/e). This completes the proof. ■ 
According to the analysis in Section 2.3, there are only 
two matrix- vector multiplications in each iteration round of 
NESVM. Thus, the time complexity of each iteration round 
is 0{n). According to Theorem |4l we can conclude the time 
complexity of NESVM is 0{n/P). 

E. Accelerating NESVM with continuation 

The homotopy technique used in lasso |22| and LARS 
1 23 1 shows the advantages of continuation method in speed- 
ing up the optimization and solving large-scale problems. In 
continuation method, a sequence of optimization problems 
with deceasing parameter is solved until the preferred value 
of the parameter is arrived. The solution of each optimization 
is used as the "warm start" for the next optimization. It has 
been proved that the convergence rate of each optimization 
is significantly accelerated by this technique, because only a 
few steps are required to reach the solution if the optimiza- 
tion starts from the "warm start". 

In NESVM, a smaller smooth parameter /i is always 
preferred because it produces more accurate approximation 
of hinge loss or the £1 norm. However, a small /i implies 
a large according to EqO and EqlJH which induces a 
slow convergence rate according to Eq|32l Hence the time 
cost of NESVM is expensive when small /i is selected. 



Algorithm 2 Homotopy NESVM 



Input: FX, w^, nj-", C, e and /i*. 
Output: weight vector w. 
Initialize: t = 0. 
repeat 



Step 1: Apply NESVM with /i = /i* and 



It;'' 



Step 2: Update /i^ = + 1), and t := t + 1 
until /i^ < /i* 
return w = w^. 



We apply the continuation method to NESVM and ob- 
tain an accelerated algorithm termed "homotopy NESVM" 
for small /i situation. In homotopy NESVM, a series of 
smoothed SVM problems with decreasing smooth parameter 
/i are solved by using NESVM, and the solution of each 
NESVM is used as the initial solution w'^ of the next 
NESVM. The algorithm stops when the preferred /i = /i* is 
arrived. In this paper, homotopy NESVM starts from a large 
/i^, and sets the smooth parameter fi at the t^^ NESVM as 



(33) 



Because the smooth parameter /j, used in each NESVM 
is large and the "warm start" is close to the solution, 
the computation of each NESVM 's solution is cheap. In 
practice, less accuracy is often allowed for each NESVM, 
thus more computation can be saved. We show homotopy 
NESVM in Algorithm 2. Notice the Lipschitz constants in 
EqIT4l and Eql2T] must be updated as the updating of the 
smooth parameter fi in Step 2. 

III. Examples 

In this section, we apply NESVM to three typical SVM 
models, i.e., classical SVM (C-SVM) |12|, linear program- 
ming SVM (LP-SVM) ifTSl and least square (LS-SVM) 
|[T4ll ll24l . They share an unified form EqO and have different 
R{w) and L{yiXi,w). In NESVM, the solutions of C- 
SVM, LP-SVM and LS-SVM are different in calculating 
the gradient item VF^{w^) and the Lipschitz constant L^. 

A. C-SVM 

In C-SVM, the regularizer R{w) in EqJSjis 

1, 



(34) 



and the loss function L{yiXi^w) is the sum of all the hinge 
losses 



L{yiXi,w) = ^h{yiXi,w) . 



(35) 



Therefore, the gradient VF^{w^) and the Lipschitz constant 



in NESVM are 



1 



= W 

Cn 

max^— — - 

/i * \\Xi 



c{Yxy u, 
WxTxM 



(36) 
(37) 



where u is the dual variable in the smoothed hinge loss and 
can be calculated according to EqlS] Thus, C-SVM can be 
solved by using Algorithm 1 and Algorithm 2. 

B. LP-SVM 

In LP-SVM, the regularizer R{w) in EqJSjis 

R{w) = \\w\\i (38) 

and the loss function L{yiXi^ w) is the sum of all the hinge 
losses 



L{yiXi, w) = ^h {yiXi, w) . 



(39) 



Therefore, the gradient VF^{w^) and the Lipschitz constant 



in NESVM are 



VF^{w^) = u-C{YXY V, 
1 . Cn llXfX." 



Ln = — \ max - 



(40) 
(41) 



where u and v are the dual variables in the smoothed 
£i-norm and the smoothed hinge loss, and they can be 
calculated according to Eq[T7] and Eq|6l respectively, /i and 
v are the corresponding smooth parameters. They are both 
updated according to EqJSS] with different initial values in 
homotopy NESVM. Thus LP-SVM can be solved by using 
Algorithm 1 and Algorithm 2. 

C. LS-SVM 

In LS-SVM, the regularizer R{w) in Eq|3]is 



R{w) 



1 



w\\l 



(42) 



and the loss function L{yiXi^w) is the sum of all the 
quadratic hinge losses 



L{y,X,,w) = ^{l-y,X,w) 



(43) 



Since both the regularizer and the loss function are smooth, 
the gradient item \/F{w^) and the Lipschitz constant L in 
NESVM are directly given by 

\/F{w^) =w^ -2C {YXf (1 - YXw^) , 



L = 1 + 2Cmax 



{11^.112}- 



(44) 
(45) 



Steps 1 in Algorithm 1 is not necessary for LS-SVM, and 
thus LS-SVM can be solved by using Algorithm 1 and 
Algorithm 2. 



IV. Experiments 

In the following experiments, we demonstrate the ef- 
ficiency and effectiveness of the proposed NESVM by 
applying it to census income categorization and several com- 
puter vision tasks, i.e., indoor/outdoor scene classification, 
event recognition and scene recognition. We implemented 
NESVM in C++ and run all the experiments on a 3.0GHz 
Intel Xeon processor with 32GB of main memory under 
Windows Vista. We analyzed its scaling behavior and the 
sensitivity to C and the size of dataset. Moreover, we com- 
pared NESVM against four benchmark SVM solvers, i.e., 
SVM-Perf 0, Pegasos 0, SVM-LightEl and LIBSVM 0. The 
tolerance used in stopping criteria of all the algorithms is set 
to 10~^. For all experiments, different SVM solvers obtained 
similar classification accuracies and performed comparably 
to the results reported in respective publications. Their effi- 
ciencies are evaluated by the training time in CPU seconds. 
All the SVM solvers are tested on 7 different C values, i.e., 
{lO-^ 10-^ 10-\ 1, 10\ 10^, 10^} for 10 times. We show 
their mean training time in following analysis. In the fist 
experiment, we also test the SVM solvers on 6 subsets with 
different sizes. 

Five experiments are exhibited, i.e., census income cate- 
gorization, indoor scene classification, outdoor classification, 
event recognition and scene recognition. Linear C-SVM 
models are adopted in the first experiment to train binary 
classifies. Nonlinear C-SVM models with the RBF kernel 
— IIX— Xj II2/P are adopted in the rest experiments, wherein 
p is the number of features. For multiclass classification 
tasks, the one-versus-one method was adopted. Pegasos is 
compared with NESVM in the first experiment, because 
its code is only available to linear C-SVM. In all the 
experiments, we set the initial solution = 0, the guess 
solution w"^ = 0, the smooth parameter /i = 5 and the 
tolerance of termination criterion e = 10~^ in NESVM. 

A. Census income categorization 

We consider the census income categorization on the 
Adult dataset from UCI machine learning repository |17|. 
The Adult contains 123 dimensional census data of 48842 
Americans. The samples are separated into two classes 
according to whether their income exceeds %b{)K/yr or 
not. Table 1 shows the number of training samples and the 
number of test samples in each subset. 

Fig IS shows the scalability of the five SVM solvers on 
the different C values. The training time of NESVM and 
Pegasos is slightly slower than the other SVM solvers for 
small C and faster than the others for large C. In addition, 
NESVM and Pegasos are least sensitive to C, because the 

^ http :// s vmlight.j oachims . org/ svm_perf.html 
^http://www.cs.huji.ac.il/~ shais/code/index.html 
^ http :// s vmlight.j oachims . org 
http ://www.csie. ntu . edu . tw/ ~ cj lin/lib s vm/ 



Set ID 
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2 


3 


4 


5 


6 


Training set 


1605 


2265 


3185 


4781 


6414 


11220 


Test set 


30956 


30296 


29376 


27780 


26147 


21341 



Table I 

Six subsets in the census income dataset. 
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Figure 3. Time cost vs C in census income categorization 

search of the most violated constraint in SVM-Perf, and the 
working set selection in SVM-Light and LIBSVM will be 
evidently slowed when C is augmented. However, the main 
computations of NESVM and Pegasos are irrelevant to C. 

FiglH shows the scalability of the five SVM solvers on 
subsets with increasing sizes. NESVM achieves the shortest 
training time when the number of training samples is less 
than 5000. Moreover, NESVM and Pegasos are least sen- 
sitive to the data size among all the SVM solvers. Pegasos 
achieves shorter training time when the number of training 
samples is more than 10000, this is because NESVM is a 
batch method while Pegasos is an online learning method. 

B. Indoor scene classification 

We apply NESVM to indoor scene classification on the 
dataset proposed in |18|. The minimum resolution of all 
images in the smallest axis are 200 pixels. The sample 
images are shown in FiglSl We choose a subset of the dataset 
by randomly selecting 1000 images from each of the five 
given groups, i.e., store, home, public spaces, leisure and 
working place. Gist features of 544 dimensions composed 
of color, texture and intensity are extracted to represent 
images. In our experiment, 70% data are randomly selected 
for training, and the rest for testing. 

Figl6] shows the scalability of four SVM solvers on the 
different C values. NESVM achieves the shortest training 
time on different C among all the SVM solvers, because 
NESVM obtains the optimal convergence rate 0(l//c^) in its 
gradient descent. LIBSVM has the most expensive time cost 
among all the SVM solvers. In addition, NESVM is least 
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Figure 6. Time cost vs C in indoor scene classification 



sensitive to C, because the main calculations of NESVM, 
i.e., the two matrix- vector multiplications, are irrelevant to 
C. SVM-Light is most sensitive to C. SVM-Perf is not 
shown in Fig |6l because its training time is much more than 
the other SVM solvers on all the C (more than 1000 CPU 
seconds). 

C Outdoor scene classification 

We apply NESVM to outdoor scene classification on the 
dataset proposed in lfT9l . It contains 13 classes of natural 
scenes, e.g., highway, inside of cities and office. The sample 
images are shown in Fig|7l Each class includes 200-400 
images, we split the images into 70% training samples and 
30% test samples. The average image size is 250 x 300 
pixels. Gist features of 352 dimensions composed of texture 
and intensity are extracted to represent grayscale images. 

Fig|8] shows the scalability of the four SVM solvers on 
the different C values. NESVM is more efficient than SVM- 
Light and LIBSVM. It took more than 100 CPU seconds for 
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Figure 7. Sample images of outdoor scene dataset 
Time cost vs Parameter C 
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Figure 8. Time cost vs C in outdoor scene classification 



SVM-Perf on each C, so we do not show SVM-Perf. 

D. Event recognition 

We apply NESVM to event recognition on the dataset 
proposed in |20|. It contains 8 classes of sports events, e.g., 
bocce, croquet and rock climbing. The size of each class 
varies from 137 to 250. The sample images are shown in 
Fig|9l Bag of words features of 300 dimensions are extracted 
according to |f20|. We split the dataset into 70% training 
samples and 30% test samples. 

Figdni shows the scalability of the four SVM solvers on 
the different C values. NESVM achieves the shortest train- 
ing time on different C among all the SVM solvers. SVM- 
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Figure 5. Sample images of indoor scene dataset 
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Figure 9. Sample images of event dataset 
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Figure 10. Time cost vs C in event recognition 



Light and LIBSVM have similar CPU seconds, because 
both of them are based on SMO. SVM-Perf has the most 
expensive time cost on different C, because advantages of 
the cutting-plane algorithm used in SVM-Perf are weakened 
in the nonlinear kernel situation. NESVM and LIBSVM are 
less sensitive to C than SVM-Perf and SVM-Light. 

E. Scene recognition 

We apply NESVM to scene recognition on the dataset 
proposed in f25l. It contains 6 classes of images, i.e., event, 
program, scene, people, objects and graphics. We randomly 
select 10000 samples from the scene class and 10000 
samples from the other classes and obtain a dataset with 
20000 samples. Bag of words features of 500 dimensions 
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Figure 11. Time cost vs C in scene recognition 



are extracted according to (25). We split the dataset into 
50% training samples and 50% test samples. 

Fig HI] shows the scalability of the four SVM solvers 
on the different C values. NESVM achieves the shortest 
training time on different C among all the SVM solvers. The 
training time of SVM-Light and LIBSVM similarly increase 
as the augment of C, because both of them are based on 
SMO. NESVM and SVM-Perf are less sensitive to C than 
LIBSVM and SVM-Light in this binary classification. 

V. Conclusion 

This paper presented NESVM to solve the primal SVMs, 
e.g., classical SVM, linear programming SVM and least 
square SVM, with the optimal convergence rate 0{l/k'^) 
and a linear time complexity. Both linear and nonlinear 
kernels can be easily applied to NESVM. In each iteration 
round of NESVM, two auxiliary optimizations are con- 
structed and a weighted sum of their solutions are adopted 
as the current SVM solution, in which the current gradient 
and the historical gradients are combined to determine the 
descent direction. The step size is automatically determined 
by the Lipschitz constant of the objective. Two matrix- vector 
multiplications are required in each iteration round. 



We propose an accelerated NESVM, i.e., homotopy 
NESVM, to improve the efficiency of NESVM when accu- 
rate approximation of hinge loss or the norm is required. 
Homotopy NESVM solves a series of NESVM with decreas- 
ing smooth parameter /i, and the solution of each NESVM is 
adopted as the "warm start" of the next NESVM. The time 
cost caused by small ji and the starting point far from 
the solution can be significantly saved by using homotopy 
NESVM. 

The experiments on various applications indicate that 
NESVM achieves the competitive efficiency compared 
against four popular SVM solvers, i.e., SVM-Perf, Pegasos, 
SVM-Light and LIB SVM, and it is insensitive to C and the 
size of dataset. NESVM can be further studied in many ar- 
eas. For example, it can be sophisticatedly refined to handle 
sparse features in document classification. Its efficiency can 
be further improved by introducing the parallel computation. 
Because the gradient of the smoothed hinge loss and the 
smoothed ii norm is already obtained, NESVM can be 
further accelerated by extending it to online learning or 
stochastic gradient algorithms. These will be mainly studied 
in our future work. 
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